Checkpointing Orchestration for Performance Improvement

نویسنده

  • Hui Jin
چکیده

Checkpointing is a mostly used mechanism for supporting fault tolerance of high performance computing (HPC), but notorious in its expensive disk access. Parallel file systems such as Lustre, GPFS, PVFS are widely deployed on super computers to provide fast I/O bandwidth for general data-intensive applications. However, the unique feature of checkpointing makes it impossible to benefit from the parallel file systems. In addition, the design of parallel file system introduces extra contention overhead for checkpointing and significantly degrades the performance. In this study, we propose checkpointing orchestration to mask the unnecessary overhead for a better performance. We extend Open MPI and PVFS to support the idea of checkpointing orchestration. The experimental results confirm the potential of the proposed checkpointing orchestration.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment

Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...

متن کامل

Checkpointing Schemes for Fast Restart in Main Memory

The potential for substantial performance improvement in a main memory database system (MMDB) is promising, since I/O activity is kept at minimum. On the other hand, due to the volatility of main memory, the issue of failure recovery becomes more complex than in traditional disk resident database systems. In this paper, we present four checkpointing schemes for the MMDB. The proposed schemes ar...

متن کامل

Nonblocking Checkpointing for Optimistic Parallel Simulation: Description and an Implementation

This paper describes a non-blocking checkpointing mode in support of optimistic parallel discrete event simulation. This mode allows real concurrency in the execution of state saving and other simulation specific operations (e.g. event list update, event execution), with the aim at removing the cost of recording state information from the completion time of the parallel simulation application. ...

متن کامل

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010